Peeter Piksarv (piksarv .at. gmail.com)
The latest version of this Jupyter notebook is available at https://github.com/ppik/playdata/tree/master/Kaggle-Expedia
This is my take on that particular Kaggle competition started off using Dataquest tutorial by Vik Paruchuri.
In [1]:
import itertools
import operator
import random
import matplotlib.pyplot as plt
import ml_metrics as metrics
import numpy as np
import pandas as pd
import sklearn
import sklearn.decomposition
import sklearn.ensemble
%matplotlib notebook
Actually don't need to unpack gzipped cvs files, pandas' read_csv
can handle those, although it can be slower (Reading 1000000 rows from train.csv.gz
seems to be about 9% slower than from train.csv
on my laptop).
Additionally, it's a good idea to specify the data types for each column tho ease the memory requirements. By default pandas detects the following data types:
In [2]:
train = pd.read_csv('data/train.csv.gz', nrows=10)
train.info()
According to the specification the data fields are following:
train.csv
Column name | Description | Data type | Equiv. type | Notes |
---|---|---|---|---|
date_time | Timestamp | string | [1] | |
site_name | ID of the Expedia point of sale | int | np.int32 | |
posa_continent | ID of continent associated with site_name | int | np.int32 | |
user_location_country | The ID of the country the customer is located | int | np.int32 | |
user_location_region | The ID of the region the customer is located | int | np.int32 | |
user_location_city | The ID of the city the customer is located | int | np.int32 | |
orig_destination_distance | Physical distance between a hotel and a customer at the time of search. A null means the distance could not be calculated | double | np.float64 | |
user_id | ID of user | int | np.int32 | |
is_mobile | 1 when a user connected from a mobile device, 0 otherwise | tinyint | np.uint8 | [2] |
is_package | 1 if the click/booking was generated as a part of a package (i.e. combined with a flight), 0 otherwise | int | np.uint8 | [2] |
channel | ID of a marketing channel | int | np.int32 | |
srch_ci | Checkin date | string | [1] | |
srch_co | Checkout date | string | [1] | |
srch_adults_cnt | The number of adults specified in the hotel room | int | np.int32 | |
srch_children_cnt | The number of (extra occupancy) children specified in the hotel room | int | np.int32 | [4] |
srch_rm_cnt | The number of hotel rooms specified in the search | int | np.int32 | [4] |
srch_destination_id | ID of the destination where the hotel search was performed | int | np.int32 | |
srch_destination_type_id | Type of destination | int | np.int32 | |
hotel_continent | Hotel continent | int | np.int32 | |
hotel_country | Hotel country | int | np.int32 | |
hotel_market | Hotel market | int | np.int32 | |
is_booking | 1 if a booking, 0 if a click | tinyint | np.uint8 | [2] |
cnt | Numer of similar events in the context of the same user session | bigint | np.int64 | |
hotel_cluster | ID of a hotel cluster | int | np.int32 |
destinations.csv
Column name | Description | Data type | Equiv. type | Notes |
---|---|---|---|---|
srch_destination_id | ID of the destination where the hotel search was performed | int | np.int32 | |
d1-d149 | latent description of search regions | double | np.float64 | [3,5] |
In [2]:
traincols = ['date_time', 'site_name', 'posa_continent', 'user_location_country',
'user_location_region', 'user_location_city', 'orig_destination_distance',
'user_id', 'is_mobile', 'is_package', 'channel', 'srch_ci', 'srch_co',
'srch_adults_cnt', 'srch_children_cnt', 'srch_rm_cnt', 'srch_destination_id',
'srch_destination_type_id', 'is_booking', 'cnt', 'hotel_continent',
'hotel_country', 'hotel_market', 'hotel_cluster']
testcols = ['id', 'date_time', 'site_name', 'posa_continent', 'user_location_country',
'user_location_region', 'user_location_city', 'orig_destination_distance',
'user_id', 'is_mobile', 'is_package', 'channel', 'srch_ci', 'srch_co',
'srch_adults_cnt', 'srch_children_cnt', 'srch_rm_cnt', 'srch_destination_id',
'srch_destination_type_id', 'hotel_continent', 'hotel_country', 'hotel_market']
Finding columns in testcols but not in traincols and vice versa:
In [4]:
[col for col in testcols if col not in traincols]
Out[4]:
In [5]:
[col for col in traincols if col not in testcols]
Out[5]:
I don't know exactly what data colmuns I will be using eventually but I will define the data types for them here anyway just in case. Looking at the data most of the columns are actually non-negative integers so I can use unsigned integers for the most cases. Usage between uint8, uint32 and others was determined by the min and max values in the test dataset.
In [3]:
def read_csv( filename, cols, nrows=None ):
datecols = ['date_time', 'srch_ci', 'srch_co']
dateparser = lambda x: pd.to_datetime(x, format='%Y-%m-%d %H:%M:%S', errors='coerce')
dtypes = {
'id': np.uint32,
'site_name': np.uint8,
'posa_continent': np.uint8,
'user_location_country': np.uint16,
'user_location_region': np.uint16,
'user_location_city': np.uint16,
'orig_destination_distance': np.float32,
'user_id': np.uint32,
'is_mobile': bool,
'is_package': bool,
'channel': np.uint8,
'srch_adults_cnt': np.uint8,
'srch_children_cnt': np.uint8,
'srch_rm_cnt': np.uint8,
'srch_destination_id': np.uint32,
'srch_destination_type_id': np.uint8,
'is_booking': bool,
'cnt': np.uint64,
'hotel_continent': np.uint8,
'hotel_country': np.uint16,
'hotel_market': np.uint16,
'hotel_cluster': np.uint8,
}
df = pd.read_csv(
filename,
nrows=nrows,
usecols=cols,
dtype=dtypes, # dtype can also specify datatypes for columns that do not excist in the particular datafile
parse_dates=[col for col in datecols if col in cols], # columns here must be also in usecols
date_parser=dateparser,
)
return df
In [5]:
train = read_csv('data/train.csv.gz', nrows=None, cols=traincols)
train.info()
With these type definitions the entire training set of 37 million entries takes 2.3 GB of memory.
In [6]:
test = read_csv('data/test.csv.gz', cols=testcols)
test.info()
Finding missing values in test data:
In [9]:
test.isnull().sum()
Out[9]:
There are also some dates where the check in date is later than check out date:
In [10]:
(test.srch_ci > test.srch_co).sum()
Out[10]:
Checking that all of the user_id-s in test set are contained in training set
In [7]:
test_ids = set(test.user_id.unique())
train_ids = set(train.user_id.unique())
test_ids <= train_ids # issubset
Out[7]:
However, not all all user_ids that are in training data are in
In [12]:
len(train_ids - test_ids)
Out[12]:
Extract month and year field from the date
In [8]:
train['month'] = train['date_time'].dt.month.astype(np.uint8)
train['year'] = train['date_time'].dt.year.astype(np.uint16)
Pick 10000 users for smaller scale testing
In [12]:
sel_user_ids = sorted(random.sample(train_ids, 10000))
sel_train = train[train.user_id.isin(sel_user_ids)]
Create new test and training sets
In [13]:
t1 = sel_train[((sel_train.year == 2013) | ((sel_train.year == 2014) & (sel_train.month < 8)))]
t2 = sel_train[((sel_train.year == 2014) & (sel_train.month >= 8))]
Remove click events from t2 as in original test data.
In [14]:
t2 = t2[t2.is_booking == True]
In [57]:
most_common_clusters = list(train.hotel_cluster.value_counts().head().index)
Predicting most_common_clusters
for every single row in selected test data.
In [13]:
predictions = [most_common_clusters for i in range(len(t2))]
Calculating Mean Average Precision with mapk
from ml_metrics
.
In [14]:
target = [[l] for l in t2['hotel_cluster']]
metrics.mapk(target, predictions, k=5)
Out[14]:
That's not too great.
In [20]:
#train.corr()['hotel_cluster']
# Calculating the correlations takes a while. No linear correlations were anyhow found in tutorial.
In [25]:
dest = pd.read_csv('data/destinations.csv.gz')
dest.info()
In [26]:
dest.head()
Out[26]:
In [27]:
pca = sklearn.decomposition.PCA(n_components=3)
dest_small = pca.fit_transform(dest[['d{}'.format(i) for i in range(1,150)]])
dest_small = pd.DataFrame(dest_small)
dest_small['srch_destination_id'] = dest['srch_destination_id']
In [46]:
dest_small.head()
Out[46]:
The variance ratio that is retained using principal components analysis with 3 principal components:
In [49]:
sum(pca.explained_variance_ratio_)
Out[49]:
date_time
, srch_ci
, and srch_co
.date_time
.dest_small
.-1
. (Initially planned to use unsigned integers for most of the variables, using -1
as fill value would not work then. May test with replacing na's using the most common values.
In [62]:
def calc_fast_features(df):
# Assumes that the data frame date_time, srch_ci and srch_co are already converted to datetime.
props = {}
for prop in ['month', 'day', 'hour', 'minute', 'dayofweek', 'quarter']:
props[prop] = getattr(df['date_time'].dt, prop)
carryover = [p for p in df.columns if p not in ['date_time', 'srch_ci', 'srch_co']]
for prop in carryover:
props[prop] = df[prop]
date_props = ['month', 'day', 'dayofweek', 'quarter']
for prop in date_props:
props['ci_{}'.format(prop)] = getattr(df['srch_ci'].dt, prop)
props['co_{}'.format(prop)] = getattr(df['srch_co'].dt, prop)
props['stay_span'] = (df['srch_co'] - df['srch_ci']).astype('timedelta64[h]')
ret = pd.DataFrame(props)
ret = ret.join(dest_small, on='srch_destination_id', how='left', rsuffix='dest')
ret = ret.drop('srch_destination_iddest', axis=1)
return ret
In [63]:
df = calc_fast_features(t1)
Using mean values to fill missing data.
In [74]:
df = df.fillna(df.mean())
In [82]:
predictors = [c for c in df.columns if c not in ['hotel_cluster']]
clf = sklearn.ensemble.RandomForestClassifier(
n_estimators=10,
min_weight_fraction_leaf=0.1,
)
scores = sklearn.cross_validation.cross_val_score(
clf,
df[predictors],
df['hotel_cluster'],
cv=5,
)
scores
Out[82]:
Classifier accuracy seems rather low here as well.
In [107]:
all_probs = []
unique_clusters = df['hotel_cluster'].unique()
for cluster in unique_clusters:
df['target'] = 0
df.loc[df['hotel_cluster'] == cluster, 'target'] = 1
predictors = [c for c in df.columns if c not in ['hotel_cluster', 'target']]
probs = []
cv = sklearn.cross_validation.KFold(len(df), n_folds=5)
clf = sklearn.ensemble.RandomForestClassifier(
n_estimators=10,
min_weight_fraction_leaf=0.1,
)
for i, (tr, te) in enumerate(cv):
clf.fit(df[predictors].iloc[tr], df['target'].iloc[tr])
preds = clf.predict_proba(df[predictors].iloc[te])
probs.append(p[1] for p in preds)
full_probs = itertools.chain.from_iterable(probs)
all_probs.append(list(full_probs))
prediction_frame = pd.DataFrame(all_probs).T
prediction_frame.columns = unique_clusters
def find_top5(row):
return list(row.nlargest(5).index)
preds = []
for index, row in prediction_frame.iterrows():
preds.append(find_top5(row))
metrics.mapk([[l] for l in t2['hotel_cluster']], preds, k=5)
Out[107]:
Using just the most popular clusters gives better scores. The approach here isn't particularly promising. One thing to note is that the input is full of categorical features. Therefore, to properly apply machine learning converting those values to separate binary features may be more appropriate approach.
In [16]:
def make_key(items):
return '_'.join([str(i) for i in items])
In [71]:
match_cols = ['srch_destination_id']
cluster_cols = match_cols + ['hotel_cluster']
groups = t1.groupby(cluster_cols)
In [73]:
top_clusters = {}
for name, group in groups:
bookings = group['is_booking'].sum()
clicks = len(group) - bookings
score = bookings + .15*clicks
clus_name = make_key(name[:len(match_cols)])
if clus_name not in top_clusters:
top_clusters[clus_name] = {}
top_clusters[clus_name][name[-1]] = score
This dictionary has a key of srch_destination_id
and each value is another dictionary, with hotel clusters as keys and scores as values.
Finding the top 5 for each destination.
In [19]:
cluster_dict = {}
for n in top_clusters:
tc = top_clusters[n]
top = [l[0] for l in sorted(tc.items(), key=operator.itemgetter(1), reverse=True)[:5]]
cluster_dict[n] = top
In [20]:
preds = []
for index, row in t2.iterrows():
key = make_key([row[m] for m in match_cols])
if key in cluster_dict:
preds.append(cluster_dict[key])
else:
preds.append(most_common_clusters)
metrics.mapk([[l] for l in t2["hotel_cluster"]], preds, k=5)
Out[20]:
In [ ]:
cluster_dict
In [41]:
match_cols = [
'user_location_country',
'user_location_region',
'user_location_city',
'hotel_market',
'orig_destination_distance',
]
groups = t1.groupby(match_cols)
def generate_exact_matches(row, match_cols):
index = tuple(row[t] for t in match_cols)
try:
group = groups.get_group(index)
except KeyError:
return []
clus = list(set(group.hotel_cluster))
return clus
exact_matches = []
for i in range(t2.shape[0]):
exact_matches.append(generate_exact_matches(t2.iloc[i], match_cols))
In [43]:
def f5(seq, idfun=None):
"""Uniquify a list by Peter Bengtsson
https://www.peterbe.com/plog/uniqifiers-benchmark
"""
if idfun is None:
def idfun(x):
return x
seen = {}
result = []
for item in seq:
marker = idfun(item)
if marker in seen:
continue
seen[marker] = 1
result.append(item)
return result
In [44]:
full_preds = [
f5(exact_matches[p] + preds[p] + most_common_clusters)[:5]
for p
in range(len(preds))
]
metrics.mapk([[l] for l in t2["hotel_cluster"]], full_preds, k=5)
Out[44]:
In [56]:
write_p = [" ".join([str(l) for l in p]) for p in full_preds]
write_frame = ["{},{}".format(t2.index[i], write_p[i]) for i in range(len(full_preds))]
write_frame = ["id,hotel_clusters"] + write_frame
with open('predictions.csv', 'w+') as f:
f.write('\n'.join(write_frame))
-- Peeter Piksarv